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The efficacy of particle identification is compared using artificial neutral networks and boosted 
decision trees. The comparison is performed in the context of the MiniBooNE, an experiment at 
Fermilab searching for neutrino oscillations. Based on studies of Monte Carlo samples of simulated 
data, particle identification with boosting algorithms has better performance than that with artificial 
neural networks for the MiniBooNE experiment. Although the tests in this paper were for one 
experiment, it is expected that boosting algorithms will find wide application in physics. 
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I. INTRODUCTION 



The artificial neural network (ANN) technique has 
been widely used in data analysis of High Energy Physics 
experiments in the last decade. The use of the ANN 
technique usually gives better results than the traditional 
simple-cut techniques. In this paper, another data classi- 
fication technique, boosting, is introduced for data anal- 
ysis in the MiniBooNE experiment Q at Fermi National 
Accelerator Laboratory. The MiniBooNE experiment is 
designed to confirm or refute the evidence for V/^i —>■ 
oscillations at Am? ~ 1 eV'^ /c^ found by the LSND 
experiment f2l|. It is a crucial experiment which will im- 
ply new physics beyond the standard model if the LSND 
signal is confirmed. Based on our studies, particle identi- 
fication (PID) with the boosting algorithm is 20 to 80% 
better than that with our standard ANN PID technique, 
the boosting performance relative to that of ANN de- 
pends on the Monte Carlo samples and PID variables. 
Although the boosting algorithm was tested in only one 
experiment, it's anticipated to have wide application in 
physics, especially in data analysis of particle physics ex- 
periments for signal and background events separation. 

The boosting algorithm is one of the most powerful 
learning techniques introduced during the past decade. 
The boosting algorithm is a procedure that combines 
many "weak" classifiers to achieve a final powerful clas- 
sifier. Boosting can be applied to any classification 
method. In this paper, it is applied to decision trees. 
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Two boosting algorithms, AdaBoostj3|| and e-BoostQ, 
are considered. A brief description of boosting algorithms 
is given in the next section. Our results are presented in 
Section III, while we summarize our conclusions in Sec- 
tion IV. 



II. BRIEF DESCRIPTION OF BOOSTING 
A. Decision Tree 

Suppose one is trying to divide events into signal and 
background and suppose Monte Carlo samples of each 
are available. Divide each Monte Carlo sample into two 
parts. The first part, the training sample, will be used 
to train the decision tree, and the second part, the test 
sample, to test the final classifier after training. 

For each event, suppose there are a number of PID vari- 
ables useful for distinguishing between signal and back- 
ground. Firstly, for each PID variable, order the events 
by the value of the variable. Then pick variable one and 
for each event value see what happens if the training sam- 
ple is split into two parts, left and right, depending on 
the value of that variable. Pick the splitting value which 
gives the best separation into one side having mostly sig- 
nal and the other mostly background. Then repeat this 
for each variable in turn. Select the variable and split- 
ting value which gives the best separation. Initially there 
was a sample of events at a "node" . Now there are two 
samples called "branches" . For each branch, repeat the 
process, i.e., again try each value of each variable for the 
events within that branch to find the best variable and 
splitting point for that branch. One keeps splitting un- 
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til a given number of final branches, called leaves, are 
obtained, or until each leaf is pure signal or pure back- 
ground, or has too few events to continue. This descrip- 
tion is a little oversimplified. In fact at each stage one 
picks as the next branch to split, the branch which will 
give the best increase in the quality of the separation. A 
schematic of a decision tree is shown in Fig.l, in which 
3 variables are used for signal/background separation: 
event hit multiplicity, energy, and reconstructed radial 
position. 

What criterion is used to define the quality of separa- 
tion between signal and background in the split? Imagine 
the events are weighted with each event having weight 
Wi. Define the purity of the sample in a branch by 



P 



where ^^e sum over signal events and 

sum over background events. Note that P(l — P) is 
if the sample is pure signal or pure background. For a 
given branch let 
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FIG. 1: Schematic of a decision tree. S for signal, B for back- 
ground. Terminal nodes(called leaves) are shown in boxes. 
If signal events are dominant in one leave, then this leave is 
signal leave; otherwise, background leave. 



Gzm = (^W,)-P(l-P), 



where n is the number of events on that branch. The 
criterion chosen is to minimize 

Ginil^ft son ~t- Gifliright son- 

To determine the increase in quality when a node is 
split into two branches, one maximizes 



Criterion = Gini 
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At the end, if a leaf has purity greater than 1/2 (or 
whatever is set), then it is called a signal leaf and if the 
purity is less than 1/2, it is a background leaf. Events 
are classified signal if they land on a signal leaf and back- 
ground if they land on a background leaf. The resulting 
tree is a decision tree. 

Decision trees have been available for some timeQ. 
They are known to be powerful but unstable, i.e., a small 
change in the training sample can give a large change in 
the tree and the results. 

There are three major measures of node impurity used 
in practice: misclassification error, the gini index and 
the cross-entropy. If we define p as the proportion of 
the signal in a node, then the three measures are: 1 - 
max(p, 1-p) for the misclassification error, 2p(l-p) for 
the gini index and -plog(p) - (l-p)log(l-p) for the cross- 
entropy. The three measures are similar, but the gini 
index and the cross-entropy are differentiable, and hence 
more amenable to numerical optimization. In addition, 
the gini index and the cross-entropy are more sensitive 
to change in the node probabilities than the misclassifi- 
cation error. The gini index and the cross-entropy are 
similar. 



B. Boosting 

Within the last few years a great improvement has 
been madej^ 0, Q. Start with unweighted events and 
build a tree as above. If a training event is misclassified, 
i.e, a signal event lands on a background leaf or a back- 
ground event lands on a signal leaf, then the weight of 
that event is increased (boosted). 

A second tree is built using the new weights, no longer 
equal. Again misclassified events have their weights 
boosted and the procedure is repeated. Typically, one 
may build 1000 or 2000 trees this way. 

A score is now assigned to an event as follows. The 
event is followed through each tree in turn. If it lands 
on a signal leaf it is given a score of 1 and if it lands on 
a background leaf it is given a score of -1. The renor- 
malized sum of all the scores, possibly weighted, is the 
final score of the event. High scores mean the event is 
most likely signal and low scores that it is most likely 
background. By choosing a particular value of the score 
on which to cut, one can select a desired fraction of the 
signal or a desired ratio of signal to background. For 
those familiar with ANNs, the use of this score is the 
same as the use of the ANN value for a given event. For 
the MiniBooNE experiment, boosting has been found to 
be superior to ANNs. Statisticians and computer scien- 
tists have found that this method of classification is very 
efficient and robust. Furthermore, the amount of tuning 
needed is rather modest compared with ANNs. It works 
well with many FID variables. If one makes a monotonic 
transformation of a variable, so that if xi > X2 then 
/(xi) > f{x2), the boosting method gives exactly the 
same results. It depends only on the ordering according 
to the variable, not on the value of the variable. 

In articles on boosting within the statistics and com- 



puter science communities, it is often recommended that 
short trees with eight leaves or so be used. For the Mini- 
BooNE Monte Carlo samples it was found that large trees 
with 45 leaves worked significantly better. 



C. Some Boosting Algorithms 

If there are N total events in the sample, the weight of 
each event is initially taken as 1/N. Suppose that there 
are Ntree trees and m is the index of an individual tree. 
Let 



the set of PID variables for the ith event. 



• Hi ~ 1 if the ith event is a signal event and jji = —1 
if the event is a background event. 

• Wi — the weight of the ith event. 

• Tm{xi) = 1 if the set of variables for the zth event 
lands that event on a signal leaf and Tm{xi) = — 1 
if the set of variables for that event lands it on a 
background leaf. 

• I{yi ^ Tm{xi)) = 1 if j/i ^ T„i{xi) and if = 

{xi ) . 

There are at least two commonly used methods for boost- 
ing the weights of the misclassified events in the training 
sample. 

The first boosting method is called AdaBoostQ. De- 
fine for the mth tree: 



a„i ^ f3 X ln((l - errm)/errm)- 

/3 = 1 is the value used in the standard AdaBoost 
method. For the MiniBooNE Monte Carlo samples, 
(3 = 0.5 has been found to give better results. Change 
the weight of each event i, i = 1, N: 

Each classifier Tm is required to be better than random 
guessing with respect to the weighted distribution upon 
which the classifier is trained. Thus, err^ is required to 
be less than 0.5, since, otherwise, the weights would be 
updated in the wrong direction. Next, renormalize the 
weights, Wi ^ Wi/ J2iLi''^i- The score for a given event 
is 



Tix) = ^ arriTmix), 



which is just the weighted sum of the scores over the 
individual trees, see Fig. 2. 
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FIG. 2: Schematic of a boosting procedure. 

The second boosting method is called e-BoostQ, or 
sometimes "shrinkage". After the mth tree, change the 
weight of each event i, i — 1, N: 



where e is a constant of the order of 0.01. Renormalize 
the weights, Wi Wi/ ^f^^Wi. The score for a given 
event is 

T{x) = ^ eTm{x), 

which is the renormalized, but unweighted, sum of the 
scores over individual trees. 

The AdaBoost and e— Boost algorithms used in this 
paper try to minimize the expectation value: i?(e~*'^(^^), 
where y = 1 for signal, y = -1 for background, F{x) = 



E 



fi{x), where the classifier fi{x) = 1 if an event 



lands on signal leaf, and fi{x) — —1 if an event lands 
on background leaf. This minimization is closely related 
to minimizing the binomial log-likelihoodj^]. It can be 
shown that i?(e~^^^^') is minimized at 



Fix) = i/.^fcl^ 
^ ' 2 P{y^-l\x) 



2 l-pix) 



Let y* ^ {y + l)/2. It is then easy to show that 



\y* -p{x)\ 



y/p{x){l -p{x)) 



The right-hand side is known as the x statistic, is 
a quadrative approximation to the log-likelihood, so x 
can be considered a gentler alternative. It turns out 
that fitting using x is monotone and smooth; the cri- 
teria will continually drive the estimates towards purer 
solutions. An ANN tries to minimize the squared-error 
E{y — where y = 1 for signal events, y = for 

background events, and F{x) is the network prediction 
for training events. 
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III. RESULTS 

For the Vj, — ^ v^^ oscillation search in the MiniBooNE 
experiment the main backgrounds come from intrinsic 
!/e contamination in the beam, mis-identified i/^ quasi- 
elastic scattering and mis-idcntificd neutral current ifi 
production. Since intrinsic z/g events are real events, 
the PID variables cannot distinguish them from oscilla- 
tion events. This report concentrates on separating 
the non-z^e events from the z/g events. Good sensitiv- 
ity for the i/e appearance search requires low background 
contamination from all kinds of backgrounds. Here, the 
ANN and the two boosting algorithms are used to sepa- 
rate charged current quasi-elastic (CCQE) events from 
non-Ve background events. 

500000 Monte Carlo z/^ events distributed among 
the many possible final states and 200000 intrinsic 
CCQE events were fed into the reconstruction package R- 
fitter^. Among these events, 88233 intrinsic CCQE 
and 162657 background events passed reconstruction and 
pre-selection cuts. 

The signature of each event is given by 52 variables 
for the R-fitter. All variables are used in the boosting 
algorithms for training and testing. It is a challenge to 
have agreement between data and Monte Carlo for all 
of the PID variables and for the boosting outputs. The 
MiniBooNE Collaboration is devoting considerable effort 
to achieve it. Monte Carlo samples using 18 different pa- 
rameter sets have been generated and run through the 
same reconstruction programs. The results for both the 
PID variables and the boosting outputs are consistent. 
When the present Monte Carlo is compared with the real 
data samples, the shapes of the various PID variables and 
the boosting outputs match well. Since the recontruc- 
tion and PID algorithms are still undergoing continuous 
modifications, relative results rather than absolute per- 
centages are presented in the following plots. 

For the AdaBoost algorithm, the parameter f3 = 0.5, 
the number of leaves Ni^aves = 45 and the number of 
tree iterations Ntree = 1000 were used. The relative 
ratio(defined as the number of background events kept 
divided by the number kept for 50% intrinsic selection 
efficiency and Ntree — 1000) as a function of Vf. selec- 
tion efficiency for various tree iterations is shown in the 
top plot of Fig. 3 and the AdaBoost output distributions 
are shown in the bottom plot. 20000 intrinsic CCQE 
signal and 30000 background events were used for train- 
ing, 68233 Ve and 132657 background events were used 
for testing. All results shown in the paper are for testing 
samples. 

In order to quantify the performance of the boosting 
algorithm, the AdaBoost results for a particular set of 
PID variables were compared with ANN results. The re- 
sults, compared as a function of the intrinsic v,, CCQE 
selection efficiency, are shown in Fig. 4. For the intrinsic 
i^e signal efficiency ranging from 40% to 60%, the per- 
formances of AdaBoost were improved by a factor of ap- 
proximately 1.5 and 1.8 over the ANN if trained by the 
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FIG. 3: Top: the number of background events kept divided 
by the number kept for 50% intrinsic selection efficiency 
and A^tree = 1000 versus the intrinsic CCQE selection effi- 
ciency. Bottom: AdaBoost output. All kinds of backgrounds 
are combined for the boosting training. 



signal and all kinds of backgrounds with 21 (red dots) 
and 52 (black boxes) input variables respectively, shown 
in Fig. 4. a. If AdaBoost and ANN were trained by the sig- 
nal and neutral current tt" background, the performances 
of AdaBoost were improved by a factor of approximately 
1.3 and 1.6 over the ANN for 22 (red dots) and 52 (black 
boxes) training variables respectively, shown in Fig.4.b. 
The best results for the ANN were found with 22 vari- 
ables, while the best results for boosting were found with 
52 variables. Comparison of the best ANN results and 
the best boosting results indicates that, when trained by 
the signal and neutral current -n^ background, the ANN 
results kept approximately 1.5 times more background 
events than were kept by the boosting algorithms for 
about 50% Vf. CCQE efficiencies. 

In Fig.4.c, the ratio of the background kept for a 52 
variable AdaBoost to that for a 21 (red dots - results for 
AdaBoost trained by the signal and all kinds of back- 
grounds) / 22 (black boxes - results for AdaBoost trained 
by the signal and neutral current tt^ background) vari- 
ables is shown as a function of v^. efficiency. It can be 
seen that the AdaBoost performance is improved by the 
use of more training variables. 

The above ANN and AdaBoost performance compar- 
ison with different input variables indicates that Ad- 
aBoost can improve the PID performance significantly by 
using more input variables, even though many of them 
have weak discriminant power; ANN, however, seems un- 
likely to make full use of all input variables because it is 
more difficult to optimize all the weights between ANN 
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FIG. 4: Comparison of ANN and AdaBoost performance for 
test samples. Relative ratio (defined as the number of back- 
ground events kept for ANN divided by the events kept for 
AdaBoost) versus the intrinsic i/^ CCQE selection efficiency, 
a) aU kinds of backgrounds are combined for the training 
against the signal, b) trained by signal and neutral current tt" 
background, c) relative ratio is re-defined as the number of 
background events kept for AdaBoost with 21(red)/22(black) 
training variables divided by that for AdaBoost with 52 train- 
ing variables. All error bars shown in the figures are for Monte 
Carlo statistical errors only. 



nodes, given more nodes in both the input and the hid- 
den layers. For the MiniBooNE Monte Carlo samples, the 
ANN are optimum for approximately 20 PID variables. 
The authors have found a similar number to be true for 
several other applications. In general, the optimum num- 
ber for ANN may vary depending on the strength of the 
PID variables and the correlations between them. 

Further evidence of this effect comes from the S- 
fitterp^, a second reconstruction-PID program set for 
the MiniBooNE. A systematic attempt was made to find 
the optimum sets of variables for ANN and for boost- 
ing classifiers by using I'e CCQE signal and 7r° back- 
ground (which includes 25 NUANCE reaction channels) . 
It is found that, for S-fitter, the optimum ANN result is 
achieved by a selected set of 22 variables, while for boost- 
ing, no obvious improvement is seen after a selected opti- 
mum set of 50 variables are used. Comparison of the best 
ANN results and the best boosting results indicates that, 
for a given fraction of i^e CCQE events kept, the ANN 
results kept about 1.2 times more 7r° background events 
than were kept by the boosting algorithms within target 



FIG. 5: Comparison of AdaBoost and e-Boost performance 
with different decision tree sizes (8 and 45 leaves per decision 
tree) versus the intrinsic i/e CCQE selection efficiency, a) 
Relative ratio is defined as the number of background events 
kept for decision tree of 8 leaves divided by that for deci- 
sion tree of 45 leaves, red dots with error bars represent re- 
sults from AdaBoost and black boxes with error bars for e- 
Boost. The tree iterations were 10000 for 8 leaves/tree and 
1800 for 45 leaves/tree, respectively, b) Relative ratio here 
is the number of background kept for AdaBoost divided by 
that for e-Boost with Ni^aves = 45. The performance compar- 
isons of AdaBoost and e-Boost with different tree iterations 
are shown in different colors, Ntree = lOO(black), 200(cyan), 
500(magenta), lOOO(yellow), 2000(blue), 5000(red). 



range of keeping close to 50% of the CCQE events. 

As noted in the introduction, two boosting algorithms 
are considered in the present paper. The comparison of 
AdaBoost and e-Boost performance is shown in Fig. 5, 
where parameters (3 — 0.5 and e = 0.01 were selected for 
AdaBoost and e-Boost training, respectively. The com- 
parison between small tree size (8 leaves) and large tree 
size (45 leaves) with a comparable overall number of de- 
cision leaves, indicates that large tree size with 45 leaves 
yields 10 20 % better performance for the MiniBooNE 
Monte Carlo samples shown in Fig. 5. a. Increasing the 
tree size past 45 leaves did not produce appreciable im- 
provement 

Comparison of AdaBoost and e-Boost performance for 
the background contamination versus the intrinsic Ve 
CCQE selection efficiency as a function of the number 
of decision tree iterations is shown in Fig.5.b. A smaller 
relative ratio implies a better performance for AdaBoost. 
The performance of AdaBoost is better than that of e- 
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Boost if the relative ratio is less than 1. Boosting perfor- 
mance in the high signal efficiency region is continuously 
improved for more tree iterations. AdaBoost has better 
performance than e-Boost for less than about 200 tree 
iterations, but becomes slightly worse than e-Boost for 
a large number of tree iterations, especially for signal 
efficiency below ^ 60%. For higher i/g signal efficiency(> 
70%), AdaBoost works slightly better than e-Boost. 

IV. CONCLUSIONS 

PID variables obtained using the R-fitter and the S- 
fitter event reconstruction programs for the MiniBooNE 
experiment were used to separate signal events from 
background events. The ANN and the boosting algo- 
rithms were compared for PID. Based on these studies 
with the MiniBooNE Monte Carlo samples, the boosting 
algorithms, AdaBoost and e-Boost, improved PID per- 
formance significantly compared with the artificial neu- 
ral network technique. This improvement manifested 
itself when a large number of PID variables was used. 
For a small number of variables, the ANN classification 
was competitive, but as the number of variables was in- 
creased, the boosting results proved more efficient and 
superior to the ANN technique. If more variables are 
needed, boosting will use them as necessary. 



It was also found that boosting with a large tree size of 
45 leaves worked significantly better than boosting with a 
small tree size, 8 leaves, as recommended in some statis- 
tics literature. 

The boosting technique proved to be quite robust. If a 
transformation of variables from x to y = f{x) is made, 
then as long as the ordering is preserved, that is if X2 > 
xi, then 2/2 > yi, the boosting results are unchanged. 
ANNs must be tuned for temperature, learning rate and 
other variables, while for boosting, there is much less to 
vary and it is quite straightforward. 

There arc certainly applications where ANNs prove 
better than boosting. However, for this application 
boosting appears superior and seems to be exceptionally 
robust and simple to use. It is anticipated that boosting 
techniques will have wide application in physics. 
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